Goto

Collaborating Authors

 federated evaluation


FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom

He, Yuanqin, Kang, Yan, Fan, Lixin, Yang, Qiang

arXiv.org Artificial Intelligence

Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.


A Survey of Federated Evaluation in Federated Learning

Soltani, Behnaz, Zhou, Yipeng, Haghighi, Venus, Lui, John C. S.

arXiv.org Artificial Intelligence

In traditional machine learning, it is trivial to conduct model evaluation since all data samples are managed centrally by a server. However, model evaluation becomes a challenging problem in federated learning (FL), which is called federated evaluation in this work. This is because clients do not expose their original data to preserve data privacy. Federated evaluation plays a vital role in client selection, incentive mechanism design, malicious attack detection, etc. In this paper, we provide the first comprehensive survey of existing federated evaluation methods. Moreover, we explore various applications of federated evaluation for enhancing FL performance and finally present future research directions by envisioning some challenges.


MedPerf: Open Benchmarking Platform for Medical Artificial Intelligence using Federated Evaluation

Karargyris, Alexandros, Umeton, Renato, Sheller, Micah J., Aristizabal, Alejandro, George, Johnu, Bala, Srini, Beutel, Daniel J., Bittorf, Victor, Chaudhari, Akshay, Chowdhury, Alexander, Coleman, Cody, Desinghu, Bala, Diamos, Gregory, Dutta, Debo, Feddema, Diane, Fursin, Grigori, Guo, Junyi, Huang, Xinyuan, Kanter, David, Kashyap, Satyananda, Lane, Nicholas, Mallick, Indranil, Mascagni, Pietro, Mehta, Virendra, Natarajan, Vivek, Nikolov, Nikola, Padoy, Nicolas, Pekhimenko, Gennady, Reddi, Vijay Janapa, Reina, G Anthony, Ribalta, Pablo, Rosenthal, Jacob, Singh, Abhishek, Thiagarajan, Jayaraman J., Wuest, Anna, Xenochristou, Maria, Xu, Daguang, Yadav, Poonam, Rosenthal, Michael, Loda, Massimo, Johnson, Jason M., Mattson, Peter

arXiv.org Artificial Intelligence

Medical AI has tremendous potential to advance healthcare by supporting the evidence-based practice of medicine, personalizing patient treatment, reducing costs, and improving provider and patient experience. We argue that unlocking this potential requires a systematic way to measure the performance of medical AI models on large-scale heterogeneous data. To meet this need, we are building MedPerf, an open framework for benchmarking machine learning in the medical domain. MedPerf will enable federated evaluation in which models are securely distributed to different facilities for evaluation, thereby empowering healthcare organizations to assess and verify the performance of AI models in an efficient and human-supervised process, while prioritizing privacy. We describe the current challenges healthcare and AI communities face, the need for an open platform, the design philosophy of MedPerf, its current implementation status, and our roadmap. We call for researchers and organizations to join us in creating the MedPerf open benchmarking platform.


Federated Evaluation of On-device Personalization

Wang, Kangkang, Mathews, Rajiv, Kiddon, Chloé, Eichner, Hubert, Beaufays, Françoise, Ramage, Daniel

arXiv.org Machine Learning

Federated learning is a distributed, on-device computation framework that enables training global models without exporting sensitive user data to servers. In this work, we describe methods to extend the federation framework to evaluate strategies for personalization of global models. We present tools to analyze the effects of personalization and evaluate conditions under which personalization yields desirable models. We report on our experiments personalizing a language model for a virtual keyboard for smartphones with a population of tens of millions of users. We show that a significant fraction of users benefit from personalization.